Distributed, Parallel, and Cluster Computing 3
♻ ☆ FlowTracer: A Tool for Uncovering Network Path Usage Imbalance in AI Training Clusters
Hasibul Jamil, Abdul Alim, Laurent Schares, Pavlos Maniotis, Liran Schour, Ali Sydney, Abdullah Kayi, Tevfik Kosar, Bengi Karacali
The increasing complexity of AI workloads, especially distributed Large
Language Model (LLM) training, places significant strain on the networking
infrastructure of parallel data centers and supercomputing systems. While
Equal-Cost Multi- Path (ECMP) routing distributes traffic over parallel paths,
hash collisions often lead to imbalanced network resource utilization and
performance bottlenecks. This paper presents FlowTracer, a tool designed to
analyze network path utilization and evaluate different routing strategies.
FlowTracer aids in debugging network inefficiencies by providing detailed
visibility into traffic distribution and helping to identify the root causes of
performance degradation, such as issues caused by hash collisions. By offering
flow-level insights, FlowTracer enables system operators to optimize routing,
reduce congestion, and improve the performance of distributed AI workloads. We
use a RoCEv2-enabled cluster with a leaf-spine network and 16 400-Gbps nodes to
demonstrate how FlowTracer can be used to compare the flow imbalances of ECMP
routing against a statically configured network. The example showcases a 30%
reduction in imbalance, as measured by a new metric we introduce.
comment: Submitted for peer reviewing in IEEE ICC 2025
♻ ☆ Efficient Scheduling of Vehicular Tasks on Edge Systems with Green Energy and Battery Storage
The autonomous vehicle industry is rapidly expanding, requiring significant
computational resources for tasks like perception and decision-making.
Vehicular edge computing has emerged to meet this need, utilizing roadside
computational units (roadside edge servers) to support autonomous vehicles.
Aligning with the trend of green cloud computing, these roadside edge servers
often get energy from solar power. Additionally, each roadside computational
unit is equipped with a battery for storing solar power, ensuring continuous
computational operation during periods of low solar energy availability.
In our research, we address the scheduling of computational tasks generated
by autonomous vehicles to roadside units with power consumption proportional to
the cube of the computational load of the server. Each computational task is
associated with a revenue, dependent on its computational needs and deadline.
Our objective is to maximize the total revenue of the system of roadside
computational units.
We propose an offline heuristics approach based on predicted solar energy and
incoming task patterns for different time slots. Additionally, we present
heuristics for real-time adaptation to varying solar energy and task patterns
from predicted values for different time slots. Our comparative analysis shows
that our methods outperform state-of-the-art approaches upto 40\% for real-life
datasets.
♻ ☆ PyGim: An Efficient Graph Neural Network Library for Real Processing-In-Memory Architectures
Christina Giannoula, Peiming Yang, Ivan Fernandez Vega, Jiacheng Yang, Sankeerth Durvasula, Yu Xin Li, Mohammad Sadrosadati, Juan Gomez Luna, Onur Mutlu, Gennady Pekhimenko
Graph Neural Networks (GNNs) are emerging ML models to analyze
graph-structure data. Graph Neural Network (GNN) execution involves both
compute-intensive and memory-intensive kernels, the latter dominates the total
time, being significantly bottlenecked by data movement between memory and
processors. Processing-In-Memory (PIM) systems can alleviate this data movement
bottleneck by placing simple processors near or inside to memory arrays. In
this work, we introduce PyGim, an efficient ML library that accelerates GNNs on
real PIM systems. We propose intelligent parallelization techniques for
memory-intensive kernels of GNNs tailored for real PIM systems, and develop
handy Python API for them. We provide hybrid GNN execution, in which the
compute-intensive and memory-intensive kernels are executed in
processor-centric and memory-centric computing systems, respectively. We
extensively evaluate PyGim on a real-world PIM system with 1992 PIM cores using
emerging GNN models, and demonstrate that it outperforms its state-of-the-art
CPU counterpart on Intel Xeon by on average 3.04x, and achieves higher resource
utilization than CPU and GPU systems. Our work provides useful recommendations
for software, system and hardware designers. PyGim is publicly available at
https://github.com/CMU-SAFARI/PyGim.